CodeSage: Code Representation Learning At Scale

Dejiao Zhang*, Wasi Uddin Ahmad*,
Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang
AWS AI Labs

CodeSage embedding models generate representations that can measure the semantic relatedness of source code snippets. CodeSage outperforms OpenAI text-embedding-ada-002, text-embedding-3-large on Code2Code search tasks, and performs on par with text-embedding-3-large on NL2Code search.

Summary


What is CodeSage?

CodeSage is a family of source code embedding models with a Transformer encoder architecture that support a wide range of source code understanding tasks, and is available in three sizes: 130M (CodeSage-Small), 356M (CodeSage-Base), 1.3B (CodeSage-Large).

How is CodeSage trained?

CodeSage is trained on the Stack dataset in two stages. In stage-1, we perform masked language modeling (MLM) with a mix of standard masking and identifier deobfuscation. In stage-2, we use contrastive learning by constructing text-code pairs.

How good is CodeSage?

Our largest model, CodeSage-Large, outperforms OpenAI text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large by 41%, 144%, and 34% (relative) respectively, on code-to-code search tasks. On text-to-code search tasks, CodeSage-Large outperforms text-embedding-ada-002, text-embedding-3-small, and on par with text-embedding-3-large.

Training Recipe


Step 1 - Masked Language Modeling

  • Random Masking: We do not follow the 80-10-10 masking convention proposed in the standard MLM for text. Since source codes are composed of NL and code tokens (i.e., identifiers, keywords, operators), random replacement of tokens could hurt both the structure and meaning of code and leads to deterioration in representation learning. We set the random masking rate to 15% which we find is optimal through our ablation study.
  • Identifier Deobfuscation: We use identifier deobfuscation (DOBF) which pretrains the model to predict the masked-out names of the identifiers. The main challenge to adopting DOBF for encoder-only models is to construct the one-on-one mapping between mask tokens (inputs to the LM) and identifier tokens (output labels) due to the differences in code tokenization (i.e., using tree-sitter) and model-specific tokenization (i.e., using a sentencepiece tokenizer). We briefly discuss the challenge in the paper.

Step 2 - Bimodal Contrastive Learning:

  • Hard Negatives: For each anchor within a mini-batch, hard negatives refer to those semantically different examples but are mapped close to the anchor in the representation space. We give higher weights to these hard negatives. We resort to a distance-based unsupervised approximation of hard negatives proposed in Zhang et al. (2021).
  • Hard Positives: We consider naturally occurring (text, function) as positive pairs, where the text is mined from the function docstring. We form hard positives by removing both function signature and return statements. We refer to the paper to learn about our reasoning behind designing hard positives.

Evaluation Results

We benchmark CodeSage with public encoder models for code: CodeBERT, GraphCodeBERT, StarEncoder, UnixCoder, and OpenAI text-embedding-ada-002, OpenAI-text-embedding-3-small, OpenAI-text-embedding-3-large. Below we show the evaluation results on:

  • Code2code search: Given a code fragment as query, retrieve other relevant code fragments.
  • NL2code search: Given a piece of natural language text as query, retrieve relevant code.

Table 1: MAP Score (%) of the zero-shot code search task.
Table 2: MRR Score (%) of NL2Code search in zero-shot setting.

Key Insights


Masking Strategy

On the 80-10-10 corruption convention. Given an input sequence, the conventional strategy for text, first randomly samples a subset of its tokens, of which 80% are replaced by a special token [MASK], 10% are left unchanged, and the other 10% are replaced by random tokens from the vocabulary. We find this 80-10-10 strategy is suboptimal for code, and it is more effective to simply replace all sampled tokens with [MASK].


Random Masking & DOBF Complement Each Other. DOBF promotes the model to better understand the structure of code as well as yields better shared representations between NL and PL. Simultaneously, random masking promotes the model to learn beyond identifiers, e.g., only 30% of the PL tokens in python are associated with identifiers. We explored two alternatives to leverage DOBF (D) and random masking (R) to complement each other. (1) Sequential (S): training the model with random masking first, then DOBF. (2) Parallel (P): randomly picking either DOBF or random masking for a training example.

Effectiveness of Contrastive Learning

  • Hard positives and negatives effectively boost performance.
  • Bimodal contrastive learning with text-code pairs outperforms dropout-based unimodal contrastive learning.

Try CodeSage on Huggingface!

from transformers import AutoModel, AutoTokenizer

checkpoint = "codesage/codesage-small"  # "codesage/codesage-base", "codesage/codesage-large"
device = "cuda"                         # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
embedding = model(inputs)[0]
print(f'Dimension of the embedding: {embedding[0].size()}')
# Dimension of the embedding: torch.Size([13, 1024])

BibTeX

@inproceedings{
    zhang2024codesage,
    title={CodeSage: Code Representation Learning At Scale},
    author={Dejiao Zhang* and Wasi Ahmad* and Ming Tan and Hantian Ding and Ramesh Nallapati and Dan Roth and Xiaofei Ma and Bing Xiang},
    booktitle={The Twelfth International Conference on Learning Representations },
    year={2024},
    url={https://openreview.net/forum?id=vfzRRjumpX}
}